GGPLOT2 Tutorial

Introduction

This is a tutorial for using R and ggplot2 to make publication-ready plots for use in the social science.1 The tutorial assumes moderate familiarity with R and base functions. I primarily made this as a learning tool for my POSC 3003/6003 courses at Arkansas State University. As such, my focus is on applications relevant to political scientists but I suspect this could be of interest to many others interested in R graphics. The associated video vignette (for students in my course) on the topic provides additional details and a walkthough every aspect of the code provided here. We will be using built-in (or generated) datasets for this entire document as I want this to remain as self-contained as possible. I hope this document can serve both as a guide for learning ggplot2 along with some tidyverse operations as well as a long-term resource on how to make a given graph. Thanks goes to the many, many folks who have provided their code openly and freely for the rest of us to learn from. All credit goes to them.

I should note that I am far from the best expert on this topic, and there are likely better tutorials out there. It is also worth noting that there are many plot types I will not include. Finally, I will not be covering every single possible aspect of plot customization. In any case, I hope some folks find this useful. If you find issues or have suggestions please hit me up on GitHub (@cwimpy) or via e-mail.

I move along in a series of sections. First I introduce ggplot2 for the one reader that has never heard of it. Then I show a series of examples for getting started along with important features. After that I show how to make a plot look nice before moving into as many examples as I can conjur. I am always adding to this section. Finally, I conclude by showing some extensions and other helpful tips such as exporting plots and using packages written to deal with special ggplot2 problems.

The source code for this document can be found in my GitHub here: https://github.com/cwimpy/ggplot2-tutorial, please also use this if you have suggestions or improvements!

The Grammar of Graphics

Now let’s move on to to learning about ggplot2. ggplot2 is an R graphic package developed by Hadley Wickham to implement the [grammar of graphics] (http://www.amazon.com/The-Grammar-Graphics-Statistics-Computing/dp/0387245448) system of data visualization proposed by Leland Wilkinson.2 The general idea is that all plots should have some basic features that make them interchangeable in the same way, no matter what is being plotted. This of it as a language for reading and writing plots. In practice ggplot2 has become the gold standard for plotting statistical data in most statistical-heavy disciplines such as bio-statistics, economics, political science, and statistics. The basic ggplot2 implementation of the grammar of graphics is summarized as follows:

  • a default dataset with aesthetic mappings,
  • one or more layers, each with a geometric object(“geom”), a statistical transformation(“stat”), and a dataset with aesthetic mappings (possibly defaulted),
  • a scale for each aesthetic mapping (which can be automatically generated),
  • a coordinate system, and
  • a facet specification.

We will tackle the meaning of these items in practical terms as we proceed. The basic idea is that a plot can (perhaps should) be built in layers and should consist of a core set of items in order to adher to the grammmar of graphics.

Getting Started

We will use the mtcars dataset for many of our example plots. Let’s have a look at what are in those data. Note that I loaded the tidyverse package quietly at the beginning of this document and that it includes ggplot2 and other tools we will be using throughout (dplyr, for example).

## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 18…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.1…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1,…

Now we can draw a blank plot where are mapped aethetics are miles per gallon (mpg) and car weight (wt).

Now we can get started with ggplot(), the primary (and neccessary) engine underlying the ggplot2 package.

Aesethtics

A Note on Coding Style for ggplot2

Before moving further, I want to make sure that the code structure is clearly understood. Experienced R users (or otherwise experienced programmers) will likely struggle with this far less than I did for the longest time. Let’s first look at the core aspects of the ggplot() code structure:

The first part of what we see here is the ggplot() function itself. The second data, x, and y in this case all refer to those in our data. Inside the outer set of parentheses we first point the plotting function to our data (this needs to be an R dataframe that is available in the workspace—or it can be a tibble). This is the first place where we see major variations in code. Some people will leave off the data = part and just type the name of the data object as a shortcut. I do this as well, but it can be confusing for folks new to ggplot2. The next block is for mapping our aesthetics. We rarely see the mapping = part of the code, and we can again just shorcut things by using aes(). Within the aes() parentheses, we then map our x and y variables by following the same convention. That is, we either say aes(x = x, y = y) or simply aes(x, y). Most of the remaining parts of the function typically need to specific, if for nothing else human readability. We will get to those later.

The next place we see major variation in code is whether folks assign their plot to an object or just run the code outright. This is not a huge distinction for actually making a plot when using well-structured code, but how we read the code certainly changes. Since R is an object-oriented programming language we can assign the whole plot to an object just like we can a dataset, variable, function, etc. The best way to see how this works is with an example. Returning to where we were above,

Finally, with the advent of the so-called pipe (%>%) operator from the maggritr package and later dplyr, among others, we will sometimes see folks do this:

Indeed, I do this on occasion. Do not be scared off by the code, however, as we are just replacing our ggplot(data = data,…) with data %>% ggplot(...) and thus it is a simple reordering of the code. Since we cannot work with pipes beyond this in ggplot it really just becomes a niche habit and not particualry useful otherwise.3

Geometric Shapes (geom) and Statistical Transformations (stat)

Scales

Facets

Making Your Plot Look Good

There are many, many opinions on how plots should look. As such, the style we end up with here will mainly be my own preference. Hopefully between this and the next section I can give you the tools you need to make the plots look as you would like them to. This is also where the library of options within ggplot2 explodes, so I am certainly not calling this a comprehensive treatment of plot improvement beyond the default.4

Building Your Own Theme

Extensions

One the most amazing aspects of ggplot2 is the sheer number of extensions that smart folks have created for it. In and of itself, ggplot2 solves more plotting problems that most any other plotting package I know of (within R or not). When you add the literally dozens of additional extenstions (usually prefixed with gg) it becomes a tour de force in data vizualization. The place to start is the extensions gallery on the main ggplot2 website. Some really incredible examples include gganimate for making animated gifs, ggthemes for some ready-made additional themes (including one that makes your graph look like it was made in default Stata!), ggforce for some serious added functionality, and ggTimeSeries for time-specific graphs. There are, of course, many more that are worth exploring.

gganimate

ggthemes

ggforce

ggtimeseries

It also warrants mentioning that many packages include ggplot2 objects in their own flavor of making graphs. The syntax ends up being different in most cases to get started, but the output is a ggplot2 object that can then be manipulated. One of my favorites in this regard is the sjPlot package from (???)(https://github.com/strengejacke) which I use for plotting Likert scales. Others include urbnmapr from the Urban Institute for making easy-peasy U.S. state maps (who wants to do that anyway?), choroplethr also for maps, and ggplotly for interactive graphics.

sjPlot

Where to Learn More

There are no shortage of other ggplot2 tutorials available online. Just snoop around until you find one you like via your search engine of choice. There are, however, a few stops I reccomend along the way:

References

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. New York: Springer.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol: O’Reilly Media.

Wilkinson, Leland. 2006. The Grammar of Graphics. New York, NY: Springer Science & Business Media.


  1. The tidyverse website is a wonderful place to spend your time.

  2. If you really want to learn about this I suggest buying the respective books from Wickham, Wilkinson, and possibly others (Wilkinson (2006), Wickham (2016), Wickham and Grolemund (2017)). The R for Data Science book is also a great resource. I do not know these authors but highly respect their work (obviously, that’s the point of all this!).

  3. Note that you need to load a package that uses the pipe in order to use it as ggplot2 does not load it automatically.

  4. Years ago when I was a graduate student you wanted to use the default to show you could use R and ggplot2. These days, since everyone is doing it, the goal is often to move as far from the default as possible.

Cameron Wimpy

2019-05-15 (updated: 2019-05-21)